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Abstract.  Studies  have  shown  that  each  person  is  more  inclined  to 
enjoy  a  group  activity  when  1)  she  is  interested  in  the  activity,  and 
2)  many  friends  with  the  same  interest  join  it  as  well.  Nevertheless, 
even  with  the  interest  and  social  tightness  information  available  in  online 
social  networks,  nowadays  many  social  group  activities  still  need  to  be 
coordinated  manually.  In  this  paper,  therefore,  we  first  formulate  a  new 
problem,  named  Participant  Selection  for  Group  Activity  (PSGA),  to 
decide  the  group  size  and  select  proper  participants  so  that  the  sum  of 
personal  interests  and  social  tightness  of  the  participants  in  the  group  is 
maximized,  while  the  activity  cost  is  also  carefully  examined.  To  solve 
the  problem,  we  design  a  new  randomized  algorithm,  named  Budget- 
Aware  Randomized  Group  Selection  (BARGS),  to  optimally  allocate 
the  computation  budgets  for  effective  selection  of  the  group  size  and 
participants,  and  we  prove  that  BARGS  can  acquire  the  solution  with 
a  guaranteed  performance  bound.  The  proposed  algorithm  was  imple¬ 
mented  in  Facebook,  and  experimental  results  demonstrate  that  social 
groups  generated  by  the  proposed  algorithm  significantly  outperform  the 
baseline  solutions. 


1  Introduction 

Studies  have  shown  that  two  important  factors  are  usually  involved  in  a  per¬ 
son’s  decision  to  join  a  social  group  activity:  (1)  interest  in  the  activity  topic 
or  content,  and  (2)  social  tightness  with  other  attendees  [5,8].  For  example,  if  a 
person  who  appreciates  jazz  music  has  complimentary  tickets  for  a  jazz  concert 
in  Rose  Theatre,  she  is  inclined  to  invite  her  friends  or  friends  of  friends  who  are 
also  jazzists.  However,  even  the  information  on  the  two  factors  is  now  available 
online,  the  attendees  of  most  group  activities  still  need  to  be  selected  manually, 
and  the  process  will  be  tedious  and  time-consuming,  especially  for  a  large  social 
activity,  given  the  complicated  social  link  structure  and  the  diverse  interests  of 
potential  attendees. 
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Recent  studies  have  explored  community  detection,  graph  clustering  and 
graph  partitioning  to  identify  groups  of  nodes  mostly  based  on  the  graph  struc¬ 
ture  [1].  The  quality  of  an  obtained  community  is  usually  measured  according 
to  its  internal  structure,  together  with  its  external  connectivity  to  the  rest  of 
the  nodes  in  the  graph  [7]  .  Those  approaches  are  not  designed  for  activity  plan¬ 
ning  because  it  does  not  consider  the  interests  of  individual  users  along  with 
the  cost  of  holding  an  activity  with  different  numbers  of  participants.  An  event 
which  attracts  too  few  or  too  many  attendees  will  result  in  unacceptable  loss 
for  the  planner.  Therefore,  it  is  important  to  incorporate  the  preference  of  each 
potential  participant,  their  social  connectivity,  and  the  activity  cost  during  the 
planning  of  an  activity. 

With  this  objective  in  mind,  a  new  optimization  problem  is  formulated, 
named  Participant  Selection  for  Group  Activity  (PSGA).  The  problem  is  given 
a  cost  function  related  to  the  group  size  and  a  social  graph  G,  where  each  node 
represents  a  potential  attendee  and  is  associated  with  an  interest  score  that 
describes  the  individual  level  of  interest.  Each  edge  has  a  social  tightness  score 
corresponding  to  the  mutual  familiarity  between  the  two  persons.  Since  each 
participant  is  more  inclined  to  enjoy  the  activity  when  1)  she  is  interested  in  the 
activity,  and  2)  many  friends  with  the  same  interest  join  as  well,  the  preference 
of  a  node  Vi  for  the  activity  can  be  represented  by  the  sum  of  its  interest  score 
and  social  tightness  scores  of  the  edges  connecting  to  other  participants,  while 
the  group  preference  is  sum  of  the  total  interest  scores  of  all  participants  and  the 
social  tightness  scores  of  the  edges  connecting  to  any  two  participants.  More¬ 
over,  the  group  utility  here  is  represented  by  the  group  preference  subtracted 
by  the  activity  cost  (ex.  the  expense  in  food  and  siting),  which  is  usually  cor¬ 
related  to  the  number  of  participants.1  The  objective  of  PSGA  is  to  determine 
the  best  group  size  and  select  proper  participants,  so  that  the  group  utility  is 
maximized.  In  addition,  the  induced  graph  of  the  set  F  of  selected  participants 
is  desired  to  be  a  connected  component,  so  that  each  attendee  is  possible  to 
become  acquainted  with  another  attendee  according  to  a  social  path2. 

One  possible  approach  to  solving  PSGA  is  to  examine  every  possible  combi¬ 
nation  on  every  group  size.  However,  this  enumeration  approach  of  group  size  k 
requires  the  evaluation  of  CJt  candidate  groups,  where  n  is  the  number  of  nodes 
in  G.  Therefore,  the  number  of  group  size  and  attendee  combinations  is  0( 2n), 
and  it  thereby  is  not  feasible  in  practical  cases.  Another  approach  is  to  incre¬ 
mentally  construct  the  group  using  a  greedy  algorithm  that  iteratively  tries  each 
group  size  and  sequentially  chooses  an  attendee  that  leads  to  the  largest  incre¬ 
ment  in  group  utility  at  each  iteration.  However,  greedy  algorithms  are  inclined 
to  be  trapped  in  local  optimal  solutions.  To  avoid  being  trapped  in  local  optimal 

1  Different  weighted  coefficients  can  be  assigned  to  the  group  utility  and  activity  cost 
according  to  the  corresponding  scenario. 

2  For  some  group  activities,  it  is  not  necessary  to  ensure  that  F  leads  to  a  connected 
subgraph,  and  those  scenarios  can  be  handled  by  adding  a  virtual  node  v  connecting 
to  every  other  node  in  G,  and  choosing  v  in  F  for  PSGA  always  creates  a  connected 
subgraph  in  G  U  {v},  but  F  may  not  be  a  connected  subgraph  in  G. 
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solutions,  randomized  algorithms  have  been  proposed  as  a  simple  but  effective 
strategy  to  solve  problems  with  large  instances  [12]. 

A  simple  randomized  algorithm  is  to  randomly  choose  multiple  start  nodes 
initially.  Each  start  node  is  considered  as  a  partial  solution,  and  a  node  neigh¬ 
boring  the  partial  solution  is  randomly  chosen  and  added  to  the  partial  solution 
at  each  iteration  later.  Nevertheless,  this  simple  strategy  has  three  disadvan¬ 
tages.  Firstly,  a  start  node  that  has  the  potential  to  generate  final  solutions  with 
high  group  utility  does  not  receive  sufficient  computational  resources  for  ran¬ 
domization  in  the  following  iterations.  More  specifically,  each  start  node  in  the 
randomized  algorithm  is  expanded  to  only  one  final  solution.  Thus,  a  good  start 
node  will  usually  fail  to  generate  a  solution  with  high  group  utility  since  it  only 
has  one  chance  to  randomly  generate  a  final  solution.  The  second  disadvantage 
is  that  the  expansion  of  the  partial  solution  does  not  differentiate  the  selection 
of  the  neighboring  nodes.  Each  neighboring  node  is  treated  equally  and  chosen 
uniformly  at  random  for  each  iteration.  Even  this  issue  can  be  partially  resolved 
by  assigning  the  selection  probability  to  each  neighboring  node  according  to  its 
interest  score  and  the  social  tightness  of  incident  edges,  this  assignment  will  lead 
to  the  greedy  selection  of  neighbors  and  thus  tends  to  be  trapped  in  local  optimal 
solutions  as  well.  The  third  disadvantage  is  that  the  linear  scanning  of  different 
group  sizes  is  not  computationally  tractable  for  real  scenarios  as  an  online  social 
network  contains  an  enormous  number  of  nodes. 

Keeping  the  above  observations  in  mind,  we  propose  a  randomized  algorithm, 
called  Budget- Aware  Randomized  Group  Selection  (BARGS),  to  effectively  select 
the  start  nodes,  expand  the  partial  solutions,  and  estimate  the  suitable  group 
size.  The  computational  budget  represents  the  target  number  of  random  solu¬ 
tions.  Specifically,  BARGS  first  selects  a  group  size  limit  kmax  in  accordance 
with  the  cost  function3.  Afterward,  m  start  nodes  are  selected,  and  neighboring 
nodes  are  properly  added  to  expand  the  partial  solution  iteratively,  until  kmax 
nodes  are  included,  while  the  group  size  corresponding  to  the  largest  group  util¬ 
ity  is  acquired  finally.  Each  start  node  in  BARGS  is  expanded  to  multiple  final 
solutions  according  to  the  assigned  budget.  To  properly  invest  the  computa¬ 
tional  budgets,  each  stage  of  BARGS  invests  more  budgets  on  the  start  nodes 
and  group  sizes  that  are  more  inclined  to  generate  good  final  solutions,  according 
to  the  sampled  results  from  the  previous  stages.  Moreover,  the  node  selection 
probability  is  adaptively  assigned  in  each  stage  by  exploiting  the  cross  entropy 
method.  In  this  paper,  we  show  that  our  allocation  of  computation  budgets  is 
the  optimal  strategy,  and  prove  that  the  solution  acquired  by  BARGS  has  a 
guaranteed  performance  bound. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  formulates  PSGA  and 
surveys  related  works.  Sections  3  explains  BARGS  and  derives  the  performance 
bound.  User  study  and  experimental  results  are  presented  in  Section  4,  and  we 
conclude  this  paper  in  Section  5. 

3  For  instance,  if  the  largest  capacity  of  available  stadiums  for  a  football  game  is 
20, 000,  kmax  is  set  as  20,  000. 
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2  Preliminary 

2.1  Problem  Definition 

Given  a  social  network  G  =  (V,E),  where  each  vertex  vt  £  V  and  each  edge 
e,;.j  £  are  associated  with  an  interest  score  and  a  social  tightness  score  Tij 
respectively,  we  study  a  new  optimization  problem  for  finding  a  set  F  of  vertices 
which  maximizes  the  group  utility  11(F),  i.e., 

U(F)=J2^+  E  ^ij)-PC(\F\),  (1) 

ViGF  Vj£F:eitj£E 

where  F  with  |F|  <  kmax  is  a  connected  subgraph  in  G  to  encourage  each 
attendee  to  be  acquainted  with  another  attendee  with  at  least  one  social  path  in 
F,  C  is  a  non-negative  activity  cost  function  based  on  the  number  of  attendees, 
and  (3  is  a  weighted  coefficient  between  the  preference  and  cost.  For  each  node 
Vi,  let  rji  +  eF-ei  j  eE  7rhl  denote  the  preference  of  node  Vi  on  the  social 
group  activity4.  PSGA  is  very  challenging  due  to  the  tradeoff  between  interest, 
social  tightness,  and  the  cost  function,  while  the  constraint  assuring  that  F  is 
connected  also  complicates  this  problem  because  it  is  no  longer  able  to  arbitrarily 
choose  any  nodes  from  G.  Indeed,  we  show  that  PSGA  is  NP-hard  in  [15]. 

2.2  Related  Works 

A  recent  line  of  study  has  been  proposed  to  find  cohesive  subgroups  in  social 
networks  with  different  criteria,  such  as  cliques,  n-clubs,  k- core,  and  fc-plex. 
Sarfyiice  et  al.  [14]  proposed  an  efficient  parallel  algorithm  to  find  a  fc-core  sub¬ 
graph,  where  every  vertex  is  connected  to  at  least  k  vertices  in  the  subgraph. 
Xiang  et  al.  [16]  proposed  a  branch-and-bound  algorithm  to  acquire  all  maximal 
cliques  that  cannot  be  pruned  during  the  search  tree  optimization.  Moreover, 
finding  the  maximum  k-plexes  was  comprehensively  discussed  in  [11].  On  the 
other  hand,  community  detection  and  graph  clustering  have  been  exploited  to 
identify  the  subgraphs  with  the  desired  structures  [1],  The  quality  of  a  com¬ 
munity  is  measured  according  to  the  structure  inside  the  community  and  the 
structure  between  the  community  and  the  rest  of  the  nodes  in  the  graph,  such 
as  the  density  of  local  edges,  deviance  from  a  random  null  model,  and  conduc¬ 
tance  [7].  Nevertheless,  the  above  models  did  not  examine  the  interest  score  of 
each  user  and  the  social  tightness  scores  between  users,  which  have  been  regarded 
as  crucial  factors  for  social  group  activities.  Moreover,  the  activity  cost  for  the 
group  is  not  incorporated  during  the  evaluation. 

In  addition  to  dense  subgraphs,  social  groups  with  different  characteristics  have 
been  explored  for  varied  practical  applications.  Expert  team  formation  in  social 

4  Different  weights  A  and  (1-A)  can  be  assigned  to  the  interest  scores  and  social  tight¬ 
ness  such  that  U(F)  =  E„ieF(A*^+(1  “  A»)  J2VjeF:aiJeE  Thj)  ~  PC(\FD-  A  can 
be  set  directly  by  a  user  or  according  to  the  existing  model  [18].  The  impacts  of 
different  A  will  be  studied  later  in  Section  4. 
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networks  has  attracted  extensive  research  interest.  The  problem  of  constructing 
an  expert  team  is  to  find  a  set  of  people  possessing  the  required  skills,  while  the 
communication  cost  among  the  chosen  friends  is  minimized  to  optimize  the  rap¬ 
port  among  the  team  members  to  ensure  efficient  operation.  Communication  costs 
can  be  represented  by  the  graph  diameter,  the  size  of  the  minimum  spanning  tree, 
and  the  total  length  of  the  shortest  paths  [9].  Finding  influential  event  organizers 
who  can  influence  largest  number  of  attendees  to  join  the  event  is  studied  [6] .  By 
contrast,  minimizing  the  total  spatial  distance  with  R-Tree  from  the  group  with 
a  given  number  of  nodes  to  the  rally  point  is  also  studied  [17].  Nevertheless,  this 
paper  focuses  on  a  different  scenario  that  aims  at  identifying  a  group  with  the  most 
suitable  size  according  to  the  activity  cost,  while  those  selected  participants  also 
share  the  common  interest  and  high  social  tightness. 

3  Algorithm  Design  for  PSGA 

To  solve  PSGA,  a  baseline  approach  is  to  incrementally  constructing  the  solution 
by  sequentially  choosing  and  adding  a  neighbor  node  that  leads  to  the  largest 
increment  in  the  group  preference  until  kmax  people  are  selected.  Afterward, 
we  derive  the  group  utility  for  each  k  by  incorporating  the  activity  cost,  1  < 
k  <  kmax,  and  extract  the  group  size  k*  with  the  maximum  group  utility.  The 
theoretical  analysis  of  greedy  algorithm  is  presented  in  [15]  due  to  the  space 
constraint. 

The  greedy  algorithm,  despite  the  simplicity,  the  search  space  of  the  greedy 
algorithm  is  limited  and  thus  tends  to  be  trapped  in  a  local  optimal  solution, 
because  only  a  single  sequence  of  solutions  is  explored.  To  address  the  above 
issues,  this  paper  proposes  a  randomized  algorithm  BARGS  to  randomly  choose 
m  start  nodes5.  BARGS  leverages  the  notion  of  Optimal  Computing  Budget 
Allocation  (OCBA)  [3]  to  systematically  generate  the  solutions  from  each  start 
node,  where  the  start  nodes  with  more  potential  to  generate  the  final  solutions 
with  large  group  utility  will  be  allocated  with  more  budgets  (i.e.,  expanded  to 
more  final  solutions).  In  addition,  since  each  start  nodes  can  generate  the  final 
solutions  with  different  group  sizes,  the  size  with  larger  group  utility  will  be 
associated  with  more  budgets  as  well  (i.e.,  generated  more  times).  Specifically, 
BARGS  includes  the  following  two  phases. 

1 )  Selection  and  Evaluation  of  Start  Nodes  and  Group  Sizes:  This  phase 
first  selects  m  start  nodes  according  to  the  summation  of  the  interest  scores 
and  social  tightness  scores  of  incident  edges.  Each  start  node  acts  as  a  seed 
to  be  expanded  to  a  few  final  solutions.  At  each  iteration,  a  partial  solution, 
which  consists  of  only  a  start  node  at  the  first  iteration  or  a  connected  set  of 
nodes  at  each  iteration  afterward,  is  expanded  by  randomly  selecting  a  node 
neighboring  to  the  partial  solution,  until  km ax  nodes  are  included.  The  group 
utility  of  each  intermediate  and  final  solution  is  evaluated  to  optimally  allocate 
different  computational  budgets  to  different  start  nodes  and  different  group  sizes 
in  the  next  phase. 

The  impact  of  m  will  be  studied  in  Section  4. 
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2)  Allocation  of  Computational  Budgets:  This  phase  is  divided  into  r  stages6, 
while  each  stage  shares  the  same  total  computational  budget.  In  the  first  stage, 
the  computational  budget  allocated  to  each  start  node  is  determined  by  the 
sampled  group  utility  in  the  first  phase.  In  each  stage  afterward,  the  computa¬ 
tional  budget  allocated  to  each  start  node  is  adjusted  by  the  sampled  results  in 
the  previous  stages.  Note  that  each  node  can  generate  different  numbers  of  final 
solutions  with  different  group  sizes.  The  sizes  with  small  group  utility  sampled 
in  the  previous  stages  will  be  associated  with  smaller  computational  budgets  in 
the  current  stage.  Therefore,  if  the  activity  cost  is  a  convex  cost  function,  the 
cost  increases  more  significantly  as  the  group  size  grows,  and  BARGS  tends  to 
allocate  smaller  computational  budgets  and  thus  generates  fewer  final  solutions 
with  large  group  sizes. 

During  the  expansion  of  the  partial  solutions,  we  differentiate  the  probability 
to  select  each  node  neighboring  to  a  partial  solution.  One  intuitive  way  is  to 
associate  each  neighboring  node  with  a  different  probability  according  to  the 
sum  of  the  interest  scores  and  social  tightness  scores  on  the  incident  edges. 
Nevertheless,  this  assignment  is  similar  to  the  greedy  algorithm  as  it  limits  the 
scope  to  only  the  local  information,  making  it  difficult  to  generate  a  final  solution 
with  large  group  utility.  By  contrast,  BARGS  exploits  the  cross  entropy  method 
[13]  according  to  sampled  results  in  the  previous  stages  in  order  to  optimally 
assign  a  probability  to  the  edge  incident  to  a  neighboring  node. 

Due  to  the  space  constraint,  the  detailed  pseudocode  is  presented  in  [15].  In 
the  following,  we  first  present  how  to  optimally  allocate  the  computational  bud¬ 
gets  to  different  start  nodes  and  different  group  sizes.  Afterward,  we  exploit  the 
cross  entropy  method  to  differentiate  the  neighbor  selection  during  the  expan¬ 
sion  of  the  partial  solutions.  The  performance  bound  and  illustrative  example  of 
the  proposed  algorithm  are  provided  in  the  full  version  [15]. 


Allocation  of  Computational  Budgets.  Similar  to  the  baseline  greedy  algo¬ 
rithm,  allocating  more  computational  budgets  to  a  start  node  w,  with  larger 
group  preference  (i.e.,  ^2v.Gp(rii  +  ,eE  7r,;j))  examines  only  the  local 

information  and  thus  is  difficult  to  generate  the  solution  with  large  group  util¬ 
ity.  Therefore,  to  optimally  allocate  the  computational  budgets  for  each  start 
node  and  each  group  size,  we  first  define  the  solution  quality  as  follows. 

Definition  1.  The  solution  quality ,  denoted  by  Q,  is  defined  as  the  maximum 
group  utility  of  the  solution  generated  from  the  m  start  nodes  among  all  sizes. 

For  each  stage  t  of  phase  2  in  BARGS ,  let  Ni^.t  denote  the  computational 
budgets  allocated  to  the  start  node  Vi  with  size  k  in  the  t-th  stage.  In  the 
following,  we  first  derive  the  optimal  ratio  of  the  computational  budgets  allocated 
to  any  two  start  nodes  Vi  and  Vj  with  group  size  k  and  l ,  respectively.  Let  two 
random  variables  Qi and  Q*  k  denote  the  sampled  group  utility  of  any  solution 


The  detailed  settings  of  the  parameters  of  the  algorithm,  such  as  m,  r,  a,  and  (3  are 
presented  in  [15]. 
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and  the  maximal  sampled  group  utility  of  a  solution  for  start  node  v%  with  size  fc, 
respectively.  If  the  activity  cost  is  not  considered,  according  to  the  central  limit 
theorem,  Qt^k  follows  the  normal  distribution  when  N± ,k  is  large,  and  it  can  be 
approximated  by  the  uniform  distribution  in  [c^fc,  di.k]  as  analyzed  in  OCBA  [3], 
where  Citk  and  ditk  denote  the  minimum  and  maximum  sampled  group  utility  in 
the  previous  stages,  respectively.  On  the  other  hand,  when  the  activity  cost  is 
considered,  the  cumulative  distribution  function  is  shifted  by  C'(fc),  and  it  still 
follows  the  same  distribution.  Therefore,  we  have  the  following  lemma. 


Lemma  1.  Assume  that  djj  >  Ci,k,  the  probability  that  the  solution  generated 
from  the  start  node  Vi  with  size  k  is  better  than  the  solution  generated  from  the 
start  node  Vj  with  size  l,  i.e.,  P(Q*k  <  Q*z),  a t  ^ eas t  )Ni'k  ■ 

Proof.  Due  to  the  space  constraint,  the  detailed  proof  is  presented  in  [15]. 


Let  Vb  and  fcj  denote  the  best  start  node  and  best  activity  size  for  Vb,  respec¬ 
tively.  The  ratio  between  Nitk,t  and  N.jj)t  equals  P(Q*fc  >  Qljk*)  :  P(Qjj  > 
Qlk *)>  which  is  optimal  as  shown  in  OCBA  [3].  However,  the  computational 
costs  for  different  group  sizes  are  not  the  same,  e.g.,  the  computational  cost  of 
the  total  group  utility  for  size  1  is  much  smaller  than  the  computational  cost 
for  size  100.  Since  the  computational  complexity  of  adding  a  node  to  a  partial 
solution  of  size  fc  —  1  is  O(k),  we  derive  the  ratio  of  the  computational  budgets 
between  Niik,t  and  Njjj  as  follows. 


Ni,k,t 


|  ■  P(Q*,k  >  Qt,ki 0 

7  •  p(Qh  >  Qlkt) ' 


(2) 


Note  that  if  the  allocated  computational  budgets  for  a  start  node  is  0  in  the 
f-th  stage,  we  prune  off  the  start  node  in  the  any  stage  afterward.  Moreover, 
when  we  generate  a  solution  with  group  size  fc,  the  solutions  from  size  1  to  size 
k  —  1  are  also  generated  as  well.  Therefore,  to  avoid  generating  an  excess  number 
the  solutions  with  small  group  sizes,  it  is  necessary  to  relocate  the  computation 
budgets.  Let  Ni.k.t  denote  the  reallocated  budget  of  start  node  Vi  with  size  k  in 
the  t-th  stage.  BARGS  derives  Nitk,t  as  follows. 


Ni,k,t  =  max(0,  Nitklt  ~  ^2  (3) 

l>k 

Specifically,  after  deriving  Nitk,t  with  Eq.  2,  BARGS  derives  N%tk,t  from  k  =  kmax 
to  1.  Initially,  Nitkiaa.x,t  =  Afterward,  for  k  —  kmax  1?  if  Ni,kmax~l,t  IS 

equal  to  Ni,kmax,t,  it  is  not  necessary  to  generate  additional  solutions  with  size 
kmax  ~  1  since  they  have  been  created  during  the  generation  of  the  solutions  with 
size  kmax-  In  this  case,  Ni,kmax-i,t  is  0.  Otherwise,  BARGS  sets  Nijkmax_itt  — 
Ni,kmaX-i,t~ Ni,,kmax,t-  The  above  process  repeats  until  k  =  1.  Since  the  number 
of  solutions  with  size  k  is  still  ^  the  computational  budget  allocation  is  still 
optimal  as  shown  in  Eq.  2. 
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Neighboring  Node  Differentiation.  To  effectively  differentiate  neighbor  selec¬ 
tion,  BARGS  exploits  the  cross  entropy  method  [13]  to  achieve  importance  sam¬ 
pling  by  adaptively  assigning  a  different  probability  to  each  neighboring  node  from 
the  sampled  results  in  previous  stages. 


Definition  2.  A  Bernoulli  sample  vector,  denoted  as  Xi}k,q  —  (%i,k,q, i>  •••)  xi,k,q,j> 
...,  Xi:k,q,n),  is  defined  to  be  the  q-th  sample  vector  from  start  node  Vi,  where  Xitk,q,j 
is  1  if  node  Vj  is  selected  in  the  q-th  sample  and  0  otherwise. 


Take  start  node  Vi  with  size  k  as  an  example,  after  collecting  N^k, i  samples 
^  k  k  2  ,  • • • ,  k  q  ,  •  • • ,  k  iVj  ,  generated  from  start  node  Vi,  BARGS  cal¬ 
culates  the  total  group  utility  U(Xitk,q )  for  each  sample  and  sorts  them  in  the 
descending  order,  >  ...  >  U^i  k  x).  Let  ji,k,i  denotes  the  group  utility  of  the 
top-p  performance  sample,  i.e.  Ji,k.i  =  U^pNi  k  xi)  .  With  those  sampled  results, 
we  set  the  selection  probability  Pi,k,t+i,j  of  every  node  Vj  in  iteration  t  +  1  for 
the  partial  solution  expanded  from  node  Vi  by  fitting  the  distribution  of  top-p 
performance  samples  as  follows. 


v-'  Ni ,k,t  t 

2^iq=  1  l{U(Xitktq)>'litk,t}Xi,k,q,i 

ENi'k.t  T 

g=l  1{U(Xitk,q)>-yi,k,t} 


(4) 


where  I{u(. xikq)>-yikt}  is  1  if  the  group  utility  of  sample  is  no  smaller 

than  a  threshold  Jt,k,t  £  R,  and  0  otherwise.  Intuitively,  the  neighbor  that  tends 
to  generate  a  better  solution  will  be  assigned  a  higher  selection  probability.  As 
shown  in  [13],  the  above  probability  assignment  scheme  has  been  proved  to  be 
optimal  from  the  perspective  of  cross  entropy.  Eq.  4  minimizes  the  Kullback- 
Leibler  cross  entropy  (KL)  distance  between  node  selection  probability  and  the 
distribution  of  top -p  performance  samples,  such  that  the  performance  of  random 
samples  in  the  ( t  +  l)-th  stage  is  guaranteed  to  be  closest  to  the  top-  p  perfor¬ 
mance  samples  in  the  t- th  stage.  Due  to  the  space  constraint,  the  illustrative 
example  and  theoretical  results  are  provided  in  the  full  version  [15]. 

Time  Complexity  of  BARGS.  The  time  complexity  of  BARGS  contains  two 
parts.  The  first  phase  selects  m  start  nodes  with  0(E  +  n+  mlogn)  time,  where 
O(E)  is  to  sum  up  the  interest  and  social  tightness  scores,  0(n  +  m  log  n)  is  to 
build  a  heap  and  extract  m  nodes  with  the  largest  sum.  Afterward,  the  second 
phase  of  BARGS  includes  r  stages,  and  each  stage  allocates  the  computational 
resources  with  0(m)  time  and  generates  0(— )  new  partial  solutions  with  at  most 
kmax  nodes  for  all  start  nodes.  Therefore,  the  time  complexity  of  the  second 
phase  is  O  (r(m+  ^ kmax ))  =  0(kmaxT ),  and  BARGS  therefore  needs  0(E  + 
mlogn  +  kmaxT). 


4  Experimental  Results 

We  implement  BARGS  in  Facebook  and  invite  50  people  from  various  commu¬ 
nities,  e.g.,  schools,  government,  technology  companies,  and  businesses  to  join 
our  user  study.  We  compare  the  solution  quality  and  running  time  of  manual 
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coordination  and  BARGS  for  answering  PSGA  problems,  to  evaluate  the  need  of 
an  automatic  group  recommendation  service.  Each  user  is  asked  to  plan  5  social 
activities  with  the  social  graphs  extracted  from  their  social  networks  in  Face- 
book.  The  interest  scores  follow  the  power-law  distribution  with  the  exponent 
as  2.5  according  to  the  recent  analysis  [4]  on  real  datasets.  The  social  tightness 
score  between  two  friends  is  derived  according  to  the  number  of  common  friends, 
which  represents  the  proximity  interaction  [2],  and  the  probability  of  negative 
weights  [10].  Then,  the  weighted  coefficient  A  on  social  tightness  scores  and  inter¬ 
est  scores  and  the  weighted  coefficient  [3  on  group  preference  and  activity  cost  in 
Footnote  4  are  set  as  the  average  value  specified  by  the  50  people,  i.e.,  A  =  0.527 
and  (3  =  0.514.  Most  importantly,  after  the  scores  are  returned  by  the  above 
renowned  models,  each  user  is  allowed  to  fine-tune  the  two  scores  by  themselves. 
In  addition  to  the  user  study,  two  real  datasets  are  evaluated  in  the  experiment. 
The  first  dataset  is  crawled  from  Facebook  with  90,  269  users  in  the  New  Orleans 
network7.  The  second  dataset  is  crawled  from  DBLP  dataset  with  511, 163  nodes 
and  1,871,070  edges. 

In  this  paper,  the  activity  cost  is  modelled  by  a  piecewise  linear  function, 
which  can  approximate  any  non-decreasing  functions.  We  set  the  activity  cost 
according  to  the  auditorium  cost  and  other  related  cost  in  Duke  Energy  Center8. 


400  —  k  if  0  <  k  <  100. 


C(k)  =  <  850  -  k  if  100  <  k  <  600. 

2200  -  k  if  600  <k<  1750. 


We  compare  deterministic  greedy  ( DGreedy ),  randomized  greedy  (RGreedy) , 
and  BARGS  in  an  HP  DL580  server  with  four  Intel  E7-4870  2.4  GHz  CPUs  and 
128  GB  RAM.  RGreedy  first  chooses  the  same  to  start  nodes  as  BARGS.  At  each 
iteration,  RGreedy  calculates  the  preference  increment  of  adding  a  neighboring 
node  Vj  to  the  intermediate  solution  Vg  obtained  so  far  for  each  neighboring 
node,  and  sums  them  up  as  the  total  preference  increment.  Afterward,  RGreedy 
sets  the  node  selection  probability  of  each  neighbor  as  the  ratio  of  the  corre¬ 
sponding  preference  increment  to  the  total  preference  increment,  similar  to  the 
concept  in  the  greedy  algorithm.  Notice  that  the  computation  budgets  represent 
the  number  of  generated  solutions.  With  more  computation  budgets,  RGreedy 
generates  more  solutions  of  group  size  kmax ,  examines  the  group  utility  by  sub¬ 
tracting  the  activity  cost  from  group  size  1  to  kmax,  and  selects  the  group  with 
maximum  group  utility.  It  is  worth  noting  that  RGreedy  is  computationally 
intensive  and  not  scalable  to  support  a  large  group  size  because  it  is  necessary 
to  sum  up  the  interest  scores  and  social  tightness  scores  during  the  selection  of 
a  node  neighboring  to  each  partial  solution.  Therefore,  we  can  only  present  the 
results  of  RGreedy  with  small  group  sizes.  Due  to  the  space  constraint,  detailed 
experimental  results  of  the  DBLP  dataset  are  presented  in  [15]. 

7  http://socialnetworks.mpi-sws.org/data-wosn2009.html 

8  http: / /www. dukeenergycenterraleigh.com/uploads/venues/rental/5-rateschedule. 
pdf 
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Fig.  1.  Results  of  user  study 


The  default  m  in  the  experiment  is  set  as  n/kmax  since  n/kmax  groups  can 
be  acquired  from  a  network  with  n  nodes  if  each  group  has  kmax  participants. 
The  default  cross-entropy  parameters  p  and  a  are  set  as  0.3  and  0.99  as  rec¬ 
ommended  by  the  cross-entropy  method  [13].  Since  BARGS  natively  supports 
parallelization,  we  also  implemented  them  with  OpenMP  for  parallelization,  to 
demonstrate  the  gain  in  parallelization  with  more  CPU  cores. 


4.1  User  Study 

Figures  l(a)-(c)  compare  manual  coordination  and  BARGS  in  the  user  study.  In 
addition,  the  optimal  solution  is  also  derived  with  the  enumeration  method  since 
the  network  size  is  very  small.  Figures  1(a)  and  (b)  present  the  solution  quality 
and  execution  time  with  different  network  sizes.  The  result  indicates  that  the 
solutions  obtained  by  BARGS  are  identical  to  the  optimal  solutions,  but  users 
are  not  able  to  acquire  the  optimal  solutions  even  when  n  =  5.  As  n  increases, 
the  solution  quality  of  manual  coordination  degrades  rapidly.  We  also  compare 
the  accuracy  of  selecting  the  optimal  group  size  in  Figure  1(c).  As  n  increases, 
it  becomes  more  difficult  for  a  user  to  correctly  identify  the  optimal  size,  while 
BARGS  can  always  select  the  optimal  one.  Therefore,  it  is  desirable  to  deploy 
BARGS  as  an  automatic  group  recommendation  service,  especially  to  address 
the  need  of  a  large  group  in  a  massive  social  network  nowadays. 

4.2  Performance  Comparison  and  Sensitivity  Analysis 

Figure  2(a)  compares  the  execution  time  of  D Greedy,  RGreedy ,  and  BARGS  by 
sampling  different  numbers  of  nodes  from  Facebook  data.  DGreedy  is  always  the 
fastest  one  since  it  is  a  deterministic  algorithm  and  generates  only  one  final  solu¬ 
tion,  whereas  RGreedy  requires  more  than  105  seconds.  The  results  of  RGreedy 
do  not  return  in  2  days  as  n  increases  to  10000.  To  evaluate  the  performance  of 
BARGS  with  multi-threaded  processing,  Figure  2(b)  shows  that  we  can  accel¬ 
erate  the  processing  speed  to  7.2  times  with  8  threads.  The  acceleration  ratio  is 
slightly  lower  than  8  because  OpenMP  forbids  different  threads  to  write  at  the 
same  memory  position  at  the  same  time.  Therefore,  it  is  expected  that  BARGS 
with  parallelization  is  promising  to  be  deployed  as  a  value-added  cloud  service. 

In  addition  to  the  running  time,  Figure  2(c)  compares  the  solution  quality 
of  different  approaches.  The  results  indicate  that  BARGS  outperforms  DGreedy 
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Fig.  2.  Experimental  results  on  Facebook  and  DBLP  datasets 


and  RGreedy ,  especially  under  a  large  n.  The  group  utility  of  BARGS  is  45% 
better  than  the  one  from  DGreedy  when  n  =  50000.  On  the  other  hand,  RGreedy 
outperforms  DGreedy  since  it  has  a  chance  to  jump  out  of  the  local  optimal 
solution. 

Figures  2(d)  and  (e)  compare  the  execution  time  and  solution  quality  of  two 
randomized  approaches  under  different  total  computational  budgets,  i.e.,  T .  As 
T  increases,  the  solution  quality  of  BARGS  increases  faster  than  that  of  RGreedy 
because  it  can  optimally  allocate  the  computation  resources.  Even  though  the 
solution  quality  of  RGreedy  is  closer  to  BARGS  in  some  cases,  BARGS  is  much 
faster  than  RGreedy  by  an  order  of  10-2. 

Figures  2(f)  and  (g)  present  the  execution  time  and  solution  quality  of 
RGreedy  and  BARGS  with  different  numbers  of  start  nodes,  i.e.,  m.  The  results 
show  that  the  solution  quality  in  Figure  2(g)  is  almost  the  same  as  m  increases, 
demonstrating  that  it  is  sufficient  for  m  to  be  set  as  a  value  smaller  than  ,  n 

Kmax 

as  recommended  by  OCBA  [3].  The  running  time  of  BARGS  for  m  =  2  is  only 
60%  of  the  running  time  for  m  =  4  as  shown  in  Figure  2(f),  while  the  solution 
quality  remains  almost  the  same. 

BARGS  is  also  evaluated  on  the  DBLP  dataset.  Figures  2(h)  and  (i)  show 
that  BARGS  outperforms  DGreedy  by  50%  and  RGreedy  by  26%  in  solution  qual¬ 
ity  when  n  =  500000.  BARGS  is  still  faster  than  RGreedy  by  an  order  of  10-2. 
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However,  RGreedy  runs  faster  on  the  DBLP  dataset  than  on  the  Facebook  dataset, 
because  the  DBLP  dataset  is  a  sparser  graph  with  an  average  node  degree  of  3.66. 
Therefore,  the  number  of  candidate  nodes  to  be  chosen  during  the  expansion  of  the 
partial  solution  in  the  DBLP  dataset  increases  much  more  slowly  than  in  the  Face- 
book  dataset  with  an  average  node  degree  of  26.1.  Nevertheless,  RGreedy  is  still 
not  able  to  generate  a  solution  for  a  large  network  size  n  due  to  its  unacceptable 
efficiency. 

5  Conclusion 

To  the  best  of  our  knowledge,  there  is  no  real  system  or  existing  work  in  the  lit¬ 
erature  that  addresses  the  issues  of  scale-adaptive  group  optimization  for  social 
activity  planning  based  on  topic  interest,  social  tightness,  and  activity  cost.  To 
fill  this  research  gap  and  satisfy  an  important  practical  need,  this  paper  for¬ 
mulated  a  new  optimization  problem  called  PSGA  to  derive  a  set  of  attendees 
and  maximize  the  group  utility.  We  proved  that  PSGA  is  NP-hard  and  devised 
a  simple  but  effective  randomized  algorithms,  namely  BARGS ,  with  a  guaran¬ 
teed  performance  bound.  The  user  study  demonstrated  that  the  social  groups 
obtained  through  the  proposed  algorithm  implemented  in  Facebook  significantly 
outperforms  the  manually  configured  solutions  by  users.  This  research  result  thus 
holds  much  promise  to  be  profitably  adopted  in  social  networking  websites  as  a 
value-added  service. 
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